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Abstract 



We theoretically investigate the convergence rate and support consistency (i.e., correctly identifying 
the subset of non-zero coefficients in the large sample limit) of multiple kernel learning (MKL). We 
focus on MKL with block-^i regularization (inducing sparse kernel combination), block-^2 regu- 
larization (inducing uniform kernel combination), and elastic-net regularization (including both 
block-i'i and block-^2 regularization). For the case where the true kernel combination is sparse, 
we show a sharper convergence rate of the block-^i and elastic-net MKL methods than the exist- 
ing rate for block-£i MKL. We further show that elastic-net MKL requires a milder condition for 
being consistent than block-^i MKL. For the case where the optimal kernel combination is not ex- 
actly sparse, we prove that elastic-net MKL can achieve a faster convergence rate than the block-£i 
and block-£2 MKL methods by carefully controlling the balance between the block-^iand block-^2 
regularizers. Thus, our theoretical results overall suggest the use of elastic-net regularization in 
MKL. 
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1 Introduction 

The choice of k ernel functions is a key issue for kernel methods such as support vector machines to work well 
(IVapnikLll998l ). A traditional b ut very powerful approach to optimizing the kernel function is the use of cross- 
validation (CV) ( IStonel[T974l) . Although the CV-based kernel choice often leads to better generalization, it 
is computationally expensive when the kernel contains multiple tuning parameters. 

To overcome this limitation, the framework of multiple kernel learning (MKL) has been i ntroduced, which 



tries to learn the o ptimal linear com bination of prefi x ed base-kernels by convex optim i zation ([L anckrie t et al. 
20041 iMicchelli and Pontil, 2005;, iLin and Zha"n3, l2006l ISonnenburg et all l2006l iRakotom amoni v et al. 
20081 ISuzuki and Tomiokali2009l) . The seminal paper bv Bach et al.l ( |2004|) showed that this MKL formula 



tion can be interpreted as block-^i regularization (i.e., €i regularization across the kernels and £2 regulariza- 
tion within the same kernel). We refer to this MKL formulation as 'block-^i MKL'. Based on this interpre- 
tation, block-£i MKL was proved to be support consistent (i.e., correctly identifying the subset of no n-zero 
coeffi cients with probability one in the large sample limit) when the true kernel combi nation is sparse (iBachl 
12008). Furthermore, the convergence rate of block- £1 MKL has also been elucidated in lKoltchinskii and Yuan! 
( l2008i) . which can be regarded as an exte nsion of the theoretical analysis for ordinary (non-block) £1 regular- 
ization (iBickel et alil2009llZhangLl2009l) . 

However, in many practical applications, the true kernel combination may not be exactly sparse. In such 
a non-sparse situation, block-^i MKL was shown to perform rather poorly — jus t the uniform combination 
of base kernels obtained by block- £2 regularization (IMicchelli and Pontil l2005l) (which we call 'block-^2 
MKL') often works better in practice ( ICortesLl2009l) . Furthermore, recent works showed that some 'interme- 
diate' regularization be tween block-£i and block-£2 regulariz ation is more promising, e.g., block-£p regular- 
ization with 1 < p < 2 jCortes et al.Ll2009llKloft et aUl2009l) . and elastic-net regularization (" Zou and Hastiel 
|2005) which includes both block-^i and block-£2 regularization dTbm ioka and Suzuki, 201(3) (we call this 
method 'elastic-net MKL'). Theore tically, the support co nsist ency and the converge nce rate for parametric 
elastic-nets have been elucidated in lYuan and Lin ( 20071) and Zou and Zhand (|2009|) . respectively, and that 
for non-parametric cases has been investigated in lMeier et al. 1 (50091) ' focusing on the Sobolev space. 

In this paper, we theoretically analyze the support consistency and convergence rate of MKL, and provide 
three new results. 



• For the case where the true kernel combination is sparse, we show that elastic-net MKL a chieves a faster 
convergence rate than the one shown for block-^i MKL ( iKoltchinskii and Yuanl . l206^ . More specifi- 
cally, we show that the L2 convergence error is given by Op{uYm.{dn~ +d\og{M)/n, n~T+= + 
d\og{M)/n}), where d is the number of active components of the target function, s is the complexity 
of RKHSs, M is the number of candidate kernels, and n is the number of samples. 

• For the case where the optimal kernel combination is not exactly sparse, we prove that elastic-net MKL 
achieves a faster convergence rate than the block-£i and block-^2 MKL methods by carefully controlling 
the balance between block-£i a nd block-i?2 r egulariz ation. Our theoretical result well agrees with the 
experimental results reported in lTomioka and Suzuki, (|201 Ob . 

• For the case where the true kernel combination is sparse, we prove that the necessary and sufficient 
conditions of t he support c onsistency for elastic-net MKL is milder than the conditions required for 



conditions or t he support c oi 
block-£i MKL ( lBaclil2008[) . 



Overall, our theoretical results suggest the use of elastic-net regularization in MKL. 

2 Preliminaries 

In this section, we formulate the elastic-net MKL approach and summarize mathematical tools that are needed 
for the theoretical analysis. 

2.1 Formulation 

Suppose we are given n samples (x; , yi)f^i where Xi belongs to an input space X and yi € M. {xi, ^re 
independent and identically distributed from a probability measure P. We denote the marginal distribution 
of X by n. We consider a MKL regression problem in which the unknown target function is represented 
as a form of f{x) = J2m=i fm{x), where each /,„ belongs to different RKHSs H,„(m = 1, . . . ,M) 
corresponding to M different base kernels fc,„ over X x X. 
Elastic-net MKL learns a decision function / a^ 



M \ ^ M M 



f = argmin - E P^' " E /'»(^^) + ^1"^ E + ^2"^ E H/' 



(1) 



where the first term is the squared-loss of function fitting and, the second and the third terms are block-£i 
and block-£2 regularizers, respectively. It can be seen from dO that elastic-net MKL is reduced to block-^i 

MKL if X^^^ — 0, which tends to induce sparse kernel combination (iLanckriet et al.Ll2004lBach et alil2004t) . 

in) 

On the other hand, it i s redu ced to block-^2 MKL if Aj^ — 0, which results in uniform kernel combination 
(iMicchelli and Pontil l2005h . It is worth noting that, elastic-net MKL allows us to obtain various levels of 

in) in) 

sparsity by controlling the ratio between \\ and . 
2.2 Notations and Assumptions 

Here, we prepare technical tools needed in the following sections. 

Due to Mercer's theorem, there are an orthonormal system {4)k.m]k,m in ^2(1!) and the spectrum 
{/ife „i}fc „j such that km has the following spectral representation: 

00 

{x)(j)k.m{x'). (2) 

By this spectral representation, the inner-product of RKHS can be expressed as {frmgm)v.,n = 

I^fcll fJ'k,mifm^^k,m)L2{n){(l>k,m,gm)L2{n)- 

Let H = Hi © • • • © Hm- For / = (/i, . . . , fn) G H and a subset of indices / C {1, . . . , A/}, we 
denote by // the restriction of / to an index set /, i.e., // — {fm)mei- 
We denote by Iq the indices of truly active kernels, i.e., 

Io^{m.\\\U\n.^>0}, 

and define the complement of Iq as Jq = Iq'''. 

Throughout the paper, we assume the following technical conditions (see also lBaclil ( l2008l) ). 



' For simplicity, we focus on the squared-loss function here. However, we note that it is straightforward to extend our 
convergence analysis and support consistency results give n in Sections[3] and[4]to general loss functions that are strongly 
convex and Lipschitz continuous, by following the line of IKoltchinskii and Yuani (i2008l) . 
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Table 1 : Summary of the constants we use in this article. 



M 


The number of candidate kernels. 


d 


The number of active kernels of the truth; i.e., d = |/o|. 


R 


The upper bound of + ||/*J|2,^); see (A4). 


s 


The spectral decay coefficient; see (A5). 




The approximate sparsity coefficient; see (A7). 


b 


The parameter that tunes the correlation between kernels; see (A8). 



Assumption 1 (Basic Assumptions) 

(Al) There exists f* = (/i*, . . • , £ Ti such that ^[Y\X] = Em=i IniW' ""'^ the noise e := Y - 
f*{X) has a strictly positive variance; there exists a > such that E[e^|X] > cr^ for all X X. We 
also assume that e is bounded as |e| < L. 

(A2) For each m = 1, . . . , M, is separable and supx^x \kmiX, X)\ < 1. 

(A3) There exists g*^ € "Hm such that 

r:,,{x)^ [ k^^'^\x,x')g*^{x')dn{x') (Vm=l,...,M), (3) 
Jx 

where km^^\x, x') = YlV=i lh!m4'k,m{x)4>k,m{x') is the operator square-root of km- 

The first assumption in (Al) ensures the model H is correctly specified, and the technical assumption |e| < L 
allows e/ to be Lipschitz continuous with respect to /. 

It is known that the assumption (A2) gives the following relation: 

>-H™<SUp||fc„i(a::, Ollw^ll/mll-H^^Jsup y/kjn{x,x)\\fm\\H^ < ||/m||w„- 

XX X 

The assumption (A3) was used in lCaponnetto and"de 'Wol(l2007h and also in lBachl(l2008h . It ensures the 
consistency of the least-squares estimates in terms of the RKHS norm. Using the spectral representation (|2]l, 
the condition S T-Lm is expressed as 

oo 

Il5™ll«„. = ^^kLifrn^ ^fc,m>L(n) < W 

k=l 

This condition was also assumed in iKoltchinskii and YuanI (|2008|) . Proposition 9 of lBachI (|2008|) gave a 
sufficient condition to fulfill ^ for translation invariant kernels km{x, x') = hrn{x — x'). 
Constants we use later are summarized in Table [T] 

3 Convergence Rate of Elastic-net MKL 

In this section, we derive the convergence rate of elastic-net MKL in two situations: 

(i) A sparse situation where the truth /* is sparse (Section lTTT i. 

(ii) A near sparse situation where the truth is not exactly sparse, but ||/r,i||-H„ decays polynomially as m 
increases (Section [J!2] i. 

For (i), we show that elas tic-net MKL (and block-i?i M KL) achieves a faster convergence rate than the rate 
shown forblock-£i MKL ( IKoltchinskii and Yuad . l20()^ . Furthermore, for (ii), we show that elastic-net MKL 
can outperform block-£i MKL and block-^2 MKL depending on the sparsity of the truth and the condition of 
the problem. Throughout this section, we assume the following conditions. 

Assumption 2 (Boundedness Assumption) There exists constants Ci and R such that 

II * II 

(A4) maxK^<Ci, Y.^\\f*J\n„. + \\f:rfuJ<R. 

™e^o ll/mll«™ 

Assumption 3 (Spectral Assumption) There exist < s < 1 and C2 such that 
(A5) Mfc,m<C'2fc- = , (1 < Vfc, 1 < Vm < M), 

where {/ifc,m}fe is the spectrum of the kernel km (see Eq.^). 
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The first assumption in (A4) appeared in Theorem 2 of iKoltchinskii and YuanI ( |2008|) . The second assump- 
tion in (A4) bounds the amplitude of /*. It was shown that the sp ectral assumption (A5) is equivalent 
to the classical covering number assumption (ISteinwart et al.L l2009l) . Recall that the e-covering number 
J\f{e, B-H^, L2(n)) with r espect to £2(11) is the minimal n umber of balls with radius e needed to cover 
the unit ball B-h^ in Urn ( Ivan der Vaart and Wellneil 1 1996b . If the spectral assumption (A5) holds, there 
exists a constant c that depends only on s such that 

Af{e,Bn^ ,L2iIl))<ce- ^\ (5) 

and the converse is also true (see Theorem 15 of ISteinwart et al.l (l2009l) and ISteinwartI ( |2008|) for details). 
Therefore, if s is large, at least one RKHS is "complex", and if s is small, the RKHSs are regarded as 
"simple". 

For a given set of indices / C {1, . . . , A/}, let k{I) be defined as follows; 

/T\ I ^ ^1 ^ II Z^me/ /™llL2(n) „, I -,,1 

:= sup <^ K > I K < 11 112 ^ \ yfrn £ Urn {m ^ I) \ . 

[ l^mel \\jm\\L2(U) J 

k{I) represents the correlation of RKHSs inside the indices /. Similarly, we define the correlations of RKHSs 
between / and I'^ as follows: 

pil) sup ( 11, li^''^';^^!"^ I // e m, gj. e nj.Ji ^ 0, gj. ^ 

In Subsections 13 . 1 1 and 13 . 21 we will assume that the kernels have no perfect canonical dependence, implying 
that the kernels are not similar to each other (see (A6) and (A8) below). 

Throughout this paper, we assume '°s(^"^") < 1 and log(A/) is slower than any polynomial order against 
the number of samples n: log(A/) = o(n^) for all e > 0. With some abuse, we use C to denote constants 
that are independent of d and n; its value may be different. 

3.1 Sparse Situation 

Here we derive the convergence rate of the estimator / when the truth /* is sparse. Let d = \Iq \ and suppose 
that the number of kernels M and the number of active kernels d are increasing with respect to the number of 
samples n. We further assume the following condition in this subsection. 

Assumption 4 (Incoherence Assumption) There exists a constant C3 > such that 

(A6) < Cg-i < k(/o)(1 - p2(/o)). (6) 

This condition is known as the incoherence condition ( IKoltchinskii and Yuarill20()8l iMeier et al.L l2009l), i.e., 
kernels are not too dependent on each other and the problem is well conditioned. Then we have the following 
convergence rate. 

Theorem 1 Under assumptions (A1-A6), there exist constants C, F and K depending only on k{Iq), p{Ia), 
s, Ci, C2, L, and R such that the L2{lVj-norm of the residual f — f* can be bounded as follows: when 



d3+sn~^ <l,forX["^ -A^"' ^ miix{Kn-^ + K2^ i^, F^'-^^^^^}, 

\\f-.n\Uu^<c[dn-^ + ^), (7) 
and, when d^+'v,-^ > I, for A^"^ = max{/^(l + Vt)n-"^, ^ /iog(^| y(n) ^ 



/-,rlli..n.<CWfe?n-A + -'"°«'"'" + " |, ,8, 



n 



where each inequality holds with probability at least 1 — e * — n ^ for all t > log log(_RY^) + log M. 

The above theorem indicates that the learning rate depends on the complexity of RKHSs (the simpler, 
the faster) and the number of active kernels rather than the number of kernels M (the influence of M 
is at most iLlSlLMi) it is worth notin g that the convergence rate in (|7) and (|8) is faster than or equal 
to the rate of block-^i MKL shown bv IKoltchinskii and Yuan! i2008h which established the learning rate 
Op (^d~n~T^ + iiiHlIMl^ under the same conditions as ours0. 

^In our second bound ([§), there is the additional term. However this can be eliminated by replacing the 

probability 1 — e~* — n^^ with 1 — e~' — as in lKoltchinskii and YuanI 120081 '). Moreover, if log(n)~^ > d, 

then the term is dominated by the first term d 1+= n~ 1+= . 
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3.2 Near-Sparse Situation 

In this subsection, we analyze the convergence rate under a situation where /* is not sparse but near sparse. 
We have shown a faster learning rate than existing bounds in the previous subsection. However, the assump- 
tions we used might be too res t rictive to capture the situation where MKL is used in practice. In fact, it was 
pointed out in IZou and Hastid (l2005h in the context of (non-block) ti regularization that Hi regularization 
could fail in the following situations: 

• When the truth /* is not sparse, the li regularization shrinks many small but non-zero components to 
zero. 

• When there exist strong correlations between different kernels, the solution of block-^i MKL becomes 
unstable. 

• When the number of kernels M is not large, there is no need to impose the estimator to be sparse. 

In order to analyze these situations in the MKL setting, we introduce three parameters /3, h, and t: /3 
controls the level of sparsity (see (A7)), b controls the correlation between candidate kernels (see (A8)), and 
T controls the growth of the number of kernels against the number of samples (see (A9)). 

We show that naturally block-^2 MKL is preferable when there are only few candidate kernels or the 
truth is dense. Importantly, if the candidate kernels are correlated, the convergence of block-^i MKL can 
be slow even when the truth is sparse. Our analysis shows that elastic-net MKL is most valuable in such an 
intermediate situation. 

By permuting indices, we can assume without loss of generality that 1 1 /*j Wu,,^ is decreasing with respect to 
TO, i.e., IL/ill-Hi > ||/2ll'H2 — ll/all^ia ^ ■■■■ We further assume the following conditions in this subsection. 



Assumption 5 (Approximate Sparsity) The truth is approximately sparse, i.e., Wf^Wu^ > 0/or all m and 
thus Iq = {1, . . . , M}. However, \\,f*,i\\'Hm decays polynomially with respect to m as follows: 

(A7) \\a\n,^<C,m-f'. 

We call /?(>!) the approximate sparsity coefficient. 

Assumption 6 (Generalized Incoherence) There exist 6 > and C4 such that for all I C {!,..., M}, 

(A8) (l-p2(/))«(/)>c4|/r^ 

Assumption 7 (Kernel-Set Growth) The number of kernels M is increasing polynomially with respect to 
the number of samples n, i.e., 3t > such that 

(A9) M = [n^l . 

For notational convenience, let n = T2 = (^7+6K2+",;l"^i'i, , ^3 = ^liStf+ffl-^ ' 

T4 = r5 = (ff+b){6(2+^)+2} ^"'^ = (i-s)(i+b) ■ ^" addition, we denote by K some sufficiently large 
constant. 

Theorem 2 Suppose assumptions (A1-A5) and (A7-A9), 2/3(1 — s) < s(b — 1), and ti < r < T4 are 
satisfied. Then the estimator of elastic-net MKL possesses the following convergence rate each of which 
holds with probability at least 1 — e~* — n^^ for all t > log \og(Ry/n) + log M: 
1. When Ti < T < T2, 

(2g + t.)(2 + s)-3-B + 23 , ^2, 



,,„ f , (2g + t.)(2 + s)-3-B + 23 CjI^, , r- ,1 

||/-rilL(n) <c{n"^^ + (n-^TW+WiT-T^ + a(") )(Vt + <)}, 



with A^"^ = max{Xn"WTO(2+iHT^ + f^^^ L^p ^^-^^i^} and A^"^ = (2/3+b)(2+»)-i-» _ 
2. When T2 < t < T3, 

-V ^ r (2 + s)b + 2 7-(2 + s){l-^)-(4^ + 2&+sb-2) / ■, 2 ^ ^ 

II/- /*lli2(n) <C'|n^5T(2+7)(E+]j)^-72 ^ 2{(/3+i>)(2+.,-=} +x^2 ){Vt + t)j, 
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with A^"^ = max{i^yf + K2^, F^/^SlM} and A^"^ = Kri -l^Z^>(^tl]-li , 
3. When < t < T4, 

„ / T(/3-l) + l-2g-b f s2 r- \ 

11/ - rilL(n) <c(n^^-^^ + (n ^c+z^) + A^") + 

where 7q = — ^ — , (11) 

" 2(6 + /3) ' ^ ' 



with A^"^ = max{i^y^ + i^a^^, F^^^^} and A^"^ = X(A//n)^^SI?r . 

Theorem 3 Under assumptions (A1-A5) and (A7-A9), if < t, the estimator fp^ ofblock-ii MKL has the 
following convergence rate with probability at least 1 — e^* — for all t > log log(i?Y^) + log M: 

{block-£, MKL) II/,, - /*||i,(n) < C (n-^^ + [Vt + t)) , 

/ 2/3 + 6-1 

^4 = + (12) 



with A^"^ = max{A'n 2+s -j- y Fy | q^^/ ^^") — q. Moreover, if t < rg, f/ie estimator 

jfi^ of block- £2 MKL has the following convergence rate with probability at least 1 — e^* — for all 
t > \oglog{Ry/n) + logM; 



{block-£2 MKL) 



114 - /1li.(n) < + (a'"'' + ^) t 



2 

w/iere 75 = — — , (13) 

2 + s 



wif/i A^"^ = max{i^(f fl„^A[") = 0. 



In all convergence rates presented in Theorems |2]and|3] the leading terms are the terms that do not contain 
t. The convergence order of the terms containing t are faster than the leading terms, thus negligible. 

By simple calculation, we can confirm that elastic-net MKL always converges faster than block-£i MKL 
and block-^2 MKL if ji and M satisfy the condition of Theorem|2] The convergence rate of elastic-net MKL 
becomes identical with block-^2 MKL and block-£i MKL at the two extreme points of the interval r = ti 
and T4, respectively. Outside the region, block-£i MKL or block-£2 MKL has a faster convergence rate than 
elastic-net MKL. Moreover, at r ~ T2, the convergence rates (|9]l and (fTOl i of elastic-net MKL are identical, 
and at r ~ t^, the convergence rates (fTOl i and (fTTT i are identical. The relation between the most preferred 
method and the growth rate r of the number of kernels is illustrated in Figure [T] 

The condition ti < r < T4 in Theorem|2]indicates that when the number of kernels is not too small or too 
large, an 'intermediate' effect of elastic-net MKL becomes advantageous. Roughly speaking, if M is large, 
sparsity is needed to ensure the convergence and thus block-£i MKL performs the best. On the other hand, 
if M is small, there is no need to make the solution sparse and thus block-^2 MKL becomes the best. For an 
intermediate M, elastic-net MKL becomes the best. 

The condition 2/3(1 — s) < s(6 — 1) in Theorem|2]ensures the existence of M that satisfies the condition 
in the theorem, i.e., n < T2 < T3 < T4. It can be seen that as 6 becomes large (the condition of the problem 
becomes worse), the range of /3 and M in which elastic-net MKL performs better than block-^i MKL and 
block-^2 MKL becomes large. This indicates that the worse the condition of the problem becomes, the more 

important to control the balance of A^"'' and Aj"^ appropriately. 



4 Support Consistency of Elastic-net MKL 

In this section, we derive necessary and sufficient conditions for the statistical support consistency of the 
estimated sparsity pattern, i.e., the probability of {m | ||/,„||-h„i 7^ 0} = /q goes to 1 as the number of 
samples n tends to infinity. Due to the additional squared regularization term, the necessary condit i on for the 
support consistency of elastic-net MKL is shown to be weaker than that for block-£i MKL ( lBachl[200^ . In 
this section, we assume M and d = \Io\ are fixed against the number of samples n. 

Let Til be the restriction of Hi • • • Hm to the index set /. Since Ex [km {X, X)] < 00 for all m (from 
assumption (A2)), we define the (non-centered) cross covariance operator Tij^j : Hi — )- Hj as a bounded 
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growth rate of the number of kernels 

Figure 1; Relation between the convergence rate and the number of kernels. If the truth is intermediately 
sparse (the growth rate r of the number of kernels is between ti and T5), then elastic-net MKL performs best. 
At the edge of the interval, the convergence rate of elastic-net MKL coincides with that of block-^i MKL or 
block-^2 MKL. 



linear operator such thalQ 

for all // = {fm)mei S ^/ ™d gj = {gm')m'e,J ^ ^J- See iBakeil ( Il973h for the details of the cross 
covariance operator (/, .g) i-> cov{f{X)g{X)). 

Moreover, we define the bounded (non-centered) cross-correlation operator^Vi^m by T,j{'^Vi^m^m^m — 
S; m. The joint cross-correlation operator : Hj Hi is defined analogously to S/.j. 
In this section, we assume in addition to the basic assumptions (A1-A3) that 

(AlO) All Vi.m are compact and the joint correlation operator V is invertible. 

Let / be the indices of active kernels for the estimated / e "H by elastic-net MKL: / := {m \ || /,„ ||-h„, > 
0}. Let D := Diag(||/*J|^^ ) = Diag((||/*J|^^^ where Diag is the |/o| x |/o| block-diagonal 

operator with operators on diagonal blocks for m G lo- In this section, we assume that the true 

sparsity pattern Iq and the number of kernels M are fixed independently of the number of samples n. 

The norm of f e H is defined by W/Wn := \/j2m=i ll/™llw,„ similarly that of // e Hi is de- 



fined byll/zll^j := yX]me/ll/™llw„ ■ The following theorem gives a sufficient condition for the support 
consistency of sparsity patterns. 



Theorem 4 Suppose A^"' > 0, A^"' ^ 0, A^"^ ^ 0, A^"' ^yrl — )■ oo, and 



lim sup„ 



<1, (VmeJ = /o"). (15) 



Ther^ under assumptions (A1-A3, AlO), ||/ — ,f*\\'H ^ and I Iq. 

The condition X2^^ > is just for technical simplicity to let S/^./q + Aj"^ invertible. The condition 

A^"^^/n oo means that A^^"'' does not decrease too quickly. The condition ( fTsl l c orresponds to an infi nite- 
dimensional extension of the elastic-net 'irrepresentable' condition. In the paper of IZhao and Yul (|2006|) . the 
irrepresentable condition was derived as a necessary and sufficient condition for the sign con sistency of £i 
regularization when the number of parameters is finite. Its elastic-net version was derived in lYuan and LinI 
(120071) . and it was extended to a situation where the number of parameters diverges as n increases (iJia and 

We also have a necessary condition for consistency. 



^ If one fits a function with a constant offset (/(x) + b instead of f{x)) as in lBachI ^20081) . then the centered version 
of cross covariance operator is required instead of the non-centered version, i.e., {fm, "^m.m' gm'}'Hm = Ex [{fm{X) — 
^x[fm])igmi (X) — Ex[(?m'])]. Howevcr, this difference is not essential because, without loss of generality, one can 
consider a situation where Ey[y] — and Ex[/m(^)] = for all fm G H-m by centering all the functions. 
Actually, such a bounded operator always exists ( lBakerl[T973h . 

^For random variables Xn and y, x„ y means the convergence in probability, i.e., the probability \x„ — y\ > e 
goes to for all e as the number of samples n tends to infinity. 
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Theorems If \\f — f*\\H ^0 and I A Iq, then under assumptions (A1-A3, AlO), there exist sequences 
such that 



-^1 : ^2 



lim sup^ 



lo 



< 1, 



(Vm eJ^I^). 



(16) 



Moreover, such X " satisfies X " yjn 00. 



The sufficient condition ( fTSI l contains the strict inequaUty ('<'), while similar conditions for ordinary 
(non-block) £1 regularization or ordinary (non-block) elastic-net regularization contain the weak inequality 
('<'). The st rict inequality appears because each block contains multiple variables in group lasso and MKL 
(lBachLl2008h . 

The condition Aj"-*-\/n — >■ 00 is necessary to impose the RKHS-norm convergence ||/ — f*\\-H A 0. 
Roughly speaking, this means that the block-£i regularization term should be stronger than the noise level to 
suppress fluctuations by noise. 

It is wo rth no t ing th at the conditions JTSl l and ( fTSI l are weaker than the condition for block-£i MKL 
presented in lBachI (|2008|) : the block-^i MKL irrepresentable condition i^ 



{(Sufficient condition) 
(Necessary condition) 



Sm,m Kn Jo ^loJa ^9*Io 
'^m,rn Kra ., la Jo^QIq 



< 1, (Vm G J), 

< 1, (Vm e J). 



(17) 



This is because the group-€2 regularization term eases the singularity of the problem. Examples that elastic- 
nets succes sfully estimate the true sparsity pattern, while ti regularization fails in parametric situations can 
be found in lJia and Yul (l2010h . 



5 Conclusions 

We provided three novel theoretical results on the support consistency and convergence rate of elastic-net 
MKL. 

(i) Elastic-net MKL was shown to be support consistent under a milder condition than block-^i MKL. 

(ii) A tighter convergence rate than existing bounds was derived for the situation where the truth is sparse. 

(iii) The convergence rates of block-^i MKL, elastic-net MKL, and block-^2 MKL when the truth is near 
sparse were elucidated, and elastic-net MKL was shown to perform better when the decrease rate /3 is 
not large, or the condition of the problem is bad. 

Based on our theoretical findings, we conclude that the use of elastic-net regularization is recommended for 
MKL. 

Elastic-net MKL can be regarded as 'intermediate' betwe en block-^i MKL and block-^2 MKL. Another 
popular intermediate variant is block-£p MKL for 1 < p < 2 jKloft et am2009l ICortes et aUl2009l) . Elastic- 
net MKL and block-€p MKL are conceptually similar, but they have a notable difference: elastic-net MKL 

(n) 

with Aj > tends to produce sparse solutions, while block-£p MKL with 1 < p < 2 always produces dense 
solutions (i.e., all combination coefficients of kernels are non-zero). Sparsity of elastic-net MKL would be 
advantageous when the true kernel combination is sparse, as we proved in this paper However, when the true 
kernel combination is non-sparse, the difference/relation between elastic-net MKL and block-^p MKL is not 
clear yet. This needs to be further investigated in the future work. 

A Proofs of the theorems 

For a function / on A" x M, we define P„/ i X]r=i fi^i^ Hi) ^^'^ Pf '■— ^x,Y[f{X, Y)]. For a function 
// e •H/, we define \\fi\\i^ as \\fi\\i>^ := Eme/ ll/mllw™ and for / G H we write := Em=i II /mil W™- 

Similarly we define as \\fi\\\ := J2,nei ll/™llw„ fi ^ '^i for / e "H we write := 

X]m=i ll/rn||^ ■ We Write max{a, 6} as a V b. 

* Note that in the original paper bv lBachI ( I2008I) . the RHS of jlll is 'Y^meio ll/"i llw,,! because the squared group-^i 
regularizer (X^„ ll/m||w„,)'^ was used. We can show that the squared formulation is actually equivalent to the non- 
squared formulation in the sense that there exists one-to-one correspondence between the two formulations. 
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Lemma 6 For all I C {1, . . . , M}, we have 



>(l-p(/m/)(5]||/„,||i^(n))- (18) 



lL(n) 

me/ 



Proof: For J ^ I'^,we have 

Pf^ = llZ/iliain) + 2(/7,/,/)L2(n) + ll/,/|li2(n) > ll//llL2(n) - 2p(/)||//||L2(n)||/j||L2(n) + ll/j|lL2(n) 

> (1 - pmwfiwi^n) > (1 - \\u\Uu)), (19) 



mel 

where we used Schwarz's inequaUty in the last Hne. 



The following lemma giv es an upper bound of y^^_i ||/||f/.„, that hold with a high probability. This is 
an extension of Theorem 1 of lKoltchinskii and YuanI (l2008h . The proof is given in Appendix iB] 



Lemma 7 There exists a constant F depending on only L in (Al) such that, if \i > Fy — -, we have, 
for r — ^<n)^^(n) . with probability 1 — n , 



M / M M 

lrn\\n„ 

m—1 



Moreover, if\'^'> > Fd'^^^^and A^"^ > X('>, we have 



M 

E ll/™-/mll«„ <M(3/2 + 2max||/;;||«„ 



m—1 



The following lemma gives a basic inequality that is a start point for the following analyses. The proof is 
given in AppendixiBl 



Lemma 8 Suppose A^""* V A2"'' > p^^SSLMul where F is the constant appeared in Lemma^ Then there 

exist constants Ki and K2 depending only on L in (Al), R in (A4), s in (A6), C2 in (A6) such that for all 
I C {1, . . . , M}, and for all t > log \og{R^/n) + log M, with probability at least 1 — e~* — n~^, 

Iwf - /iiLcn) + E 11/^ - fiWk. + 4"^ E ii/™ii«.. + (^"^ - - ^2/1) E ii/™ii«™ 

<i^i(i + 11/ - )( E " " ^' ll/*" " /"'ll^- ' ^il/-/*lki 



V _^ 

mei '^'+° 



+E fA("'^^+2A(")|lg;j|«„.+ K,.[i] ||/,„ - /,;|U,(n) 
me/ V »Jrnm„, V nj 

+ A^"^ E II/™IIh™+ U^^+'fn + E ll/™ll«- (20) 

where J = r, 7„ := ^ a«(i7„ := 7„(1 + ||/ - /*l|oo)- 

The above lemma is derived by peeling d e vice or localization method. Details of those techniques ca n be 
found in, for example, iBartlett et all (l2005b . iKoltchinskiil (|2006|) . iMendelsonI (|2002|) . Ivan de Geeij (l200d) . 



Proof: (Theorem [D Since A^"^ > j^J '°^^^f "\ we can assume that the inequality (l20l l is satisfied with 

I ~ Iq. For notational simplicity, we suppose / denotes /g in this proof. In addition, since 

ll/lloo < < 3i? (with probability 1 - n-^) by Lemmal?] Note that ||/*J|«„ = for all 
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m e J = r = IS, and % + A'a^^ < m&x{Kn~^ + y^, ^ ^^si^ } = A^"^ by taking K 
sufficiently large. Therefore by the inequality (EOt . we have 

Wfm ^ fmW L2(n)ll-^'" ^ ■fm\\v.,-n t 



j!i/-riiL(n) + Ar'iiA-/;iii<A'i(E 



I!/, 



where i^i is + 3R) (here we omitted the term J2mei ^ /mll'H,„ foi" simplicity. One can 

show that that term is negligible). 

By Holder's inequality, the first term in the RHS of the above inequality can be bounded as 

^ /mllL2(n)ll'^™ ^ . „ (Z^me/ ^ /mllL2(n))^"''(ll// - // Ik)" 

^1 2. -c: < 



< VdK •^™llL(n)) ^ (llZ-f //life)' 



Applying Young's inequality, the last term can be bounded by 

;E II/™ - /^iiLm)"^ X (A^"V2)*(ii// - /;iil)^ 



\\Jr,i - Jrn\\L2(n)) " ^ 1^2 

2 

2 



<C(n-i^/dAr^'')^ Y: \\.L /;jlL(n) + ^11// - /z* 



f*||2 



\m6/ / 

<c[(i - p2(/)).(/)]-^^-idAr^^^ + ' ^'^^^^"^^^ E II/™ - /™iiL(n) + 4^11// - ml 

— 1 x^"^ 
<Cn-idA(")"' + -11/ - /*lli^(n) + ^11// - nWl- (22) 

where C denotes a constant that is independent of d and n and changes by the contexts, and we used Lemma 
|6]in the last line. Similarly, by the inequality of arithmetic and geometric means, we obtain a bound as 



E 2 U 

mel \ 



(n) llffmll^An , II , £' / ^ I II f -f* I 

llj^ + ^2 llffmll-H™ + J^'2\l - I ll./m - /,„||L2(n) 

2 



<c[(i - ADHDr E I (^jfe:) ^i"'' + \\9*r.rnA"'' + 
{i-p^i))4i) 

~f g 2^ WJni JmllL2(n) 



12 _ 1 



<C(dAi"^ +rfV«) + ^ll/-/1lL(n), (23) 



where we used Lemma|6]in the last line. By substituting {22\ and ( |23] ) to ( |2TI ). we have 

ill/ - /1lL(n) < C f rfn-Ar^-^ + dA^' + A^^^ + ^) . (24) 



The minimum of the RHS with respect to A^"^ , A2"^ under the constraint Aj"'' > Aj"^ is achieved by 



A^"^ = mSix{Kn~^ + K2\JJ,, f \/ '°^^,f }, A^"^ = Kn~^ up to constants. Thus we have the first 
assertion (|7]i. 



10 



Next we show the second assertion dSJ. By Holder's inequality and Young's inequality, we have 

II/™ - /mlli^fn)!!/'" ^ /mllw™ ^ (Eme/ ll/m ~ /ml|L2(n))^~'*(||// - // ll^l)'' 

^1 2- zm ^ ^1 



me/ 

< CA-T^n-3n^ Erne/ ll/m - /;j|L.(n) + |ll// - //lU, 

< CdA-i^n-T^ + ^11/- r||i,(n) + |(ll//||£, + ll/;ilfj, (25) 
where A > is an arbitrary positive real. By substituting (l25i and (|23) to ( l2Tl i. we have 

^11/ - r lli.,n) < C{d\-^^n-^^ + A + dAi")^ + A^"^^ + ^ 



ThisisminimizedbyA = Cd^n-TTJ,A^"^ = ( 2j^i(i+3fl) +K./ljyFJ]2s{M2}l > (27„ + X2J^)V 



F^J ^°^^^ —, and Aj""* < A^^"''. Thus we obtain the assertion. ■ 

Proof: (Theorem IS Let := {1, . . . , d} and = = {d + I, . . . ,M}. By the assumption (A7), we 
have ZrneJ, ll/™ll«,„ < W^id'~'^ EmeJ, ll/mll«,. < Therefore LemmaEgives 

11/ - r liLm) + Ar'iiA. - /;jil + Ar^ii./,7ji.^, 



(n) 

ll/ni ^ /mil L2(n)ll/™ ^ /mllwm i||/ — ./*||£i 



me/jj 



, /v^ll? ^* II \ / 'l-^™ /mllL2(n)ll/'" /rnllw™ , t\\f-f*\\i^ 

+ K^\^\\U-U\n,„ ( ^ + 

\m = l / me/d ^ 



meld \ 



C I A(")di-2^ + ( a(") + 7„ + \/- d^-^^ , (26) 



if A^"'' > 7„ + Y ^ and A^"'' > Fy °^^^^^ . The second term can be upper bounded as 

( f \\L-m\u,] ( E , ^ii/-/ii..- 



H^ev ^ (Emg/JI/m-/mlU.(n))^-^(Emg/JI/™~/mll^„r . t||/-ri|,, 



\ ( 1^ \\f .„,^l-sf 

(Lme/d II/™ ~ /mlli2(n))^"'' (Em=l ll/m ^ /mll«™) (Eme/d H/™ ~ fniWu,^)" t\\j - f*\\j^ 



\7rL—l 



y n ?7. 

Jensen ^'^'(Eme/, II/™ " /m 1 1 L (n) ) A//^ = i ||/m " /mlll.„) ' (Eme/, II/- f^^MiJ^ 
< Kl ;= 

, t\\f-r\\i 

n 

Len,ma|6] { (1 - M/rf)}- ( 1 1 / - /1| ) 1 1 / - /III+' ^11/- /II?, 

< A 1 -= 1 

n 

^T^ II/-/1lL(n) ^ ^, {{1^ p{m<m-^drhMrh\\f - f*\\l t\\f - r\\l 
S r 1 1 

'< "'^-^"--"' + c ^"^"* ii/ - rill . 
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b{l-s) + l 1^ 

We will see that we may assume 

(j d 1+^ ^ MTT? ^ j^^^ ^j^g second term in the RHS of the above 



inequality can be upper bounded as 



g "° " n/-rii.^. < ^11/ -nil < V (iiA. + 2ii/,7jii + 2||/}ji,2. 



< 



A 



(ii//.-/;jil + ii//jil + ii/}jil)- (27) 



Moreover Lemma |7] gives 



< CX: > and 



< 



becomes 



< Ci?A^"^ . Therefore 



1 



* l|2 

d iidWt-i 



An) 



2"' - ''^^m ■ 2 

~ \ ^ a/? 



-!l/jJ 



^(ciA1")+2A 



+ C A(")di-2^ + a1") + 7„ + ,/ a di-'' 



As in the proof of Theorem [T] (using the relations (|2?t and (l22t). we have 

1, 



2 11/ ~ /*lli2(n) 
<c\\{\-p\U))n{U))Y^ 



+ A^"^di-2^ + (a1") + 7„ + + <A: 



2 



Now using the assumption (1 — p^[ld))K[Id) > Cid ^, we have 

n2 ,„>2 



ll//d - fid 



12 <C 



i2(n) 



V n n 



(28) 



Remind that % = A'i(l + \\f - /*!|oo)/\/^- Since A^^"^ > F 



log(J\/n) 



Lemma|7]gives ||/ - /*| 



< 



\/ M3R + R < c\fM with probability 1 ~ n ^ for some constant c > 0. Therefore 7„ < c^jAljn. The 
values of A$^"\ Aj"'' presented in the statement is achieved by minimizing the RHS of Eq. (1281 ) under the 



constraint A^"' > c^jM/n + K^J ^ > 7„ K^J ^ and C 



6(1-3) + ! 1 



< 



i) Suppose n (2ti+b)(2+s)-i-s > c^J M /n, i.e., r < T2. Then the RHS of the above inequality can be 

1 2)3 + 6-1 („\ ^ i>+3/3-l 

minimized by d = n (^f+^xa+^j-i-^ , ' = Kn (2,s+b)(2+=)-i-= ^ and A^ = max{i^n (2,s+b)(2+.)-i-= ^ 
^'2\f^T^\f^^^^^} to constants independent of n, where the leading terms are d^^^n^^X^^^ + 
d''A^"'V A^"^di-2'^ -I- A*"^(ii-'3. It should be noted that A^"' is greater than 7„ + /^2a/^ because 



(2^f+6)(2+3)-i-3 > c^Mjn > 7„, therefore (126b is valid. Using t < T2, we can show that 
Cd r+= (M/n)T+= < A2"''/4 by setting the constant K sufficiently large, hence (|27] | is valid. Moreover, 
since M > nW+w+^F^T^ = n^^, we can take d as d = n(2/3+b)(2+»)-i-» < 
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ii) Suppose T2 < r < T3. Then the RHS of the above inequality can be minimized 

by d = (M2+^n2-'^)2{(2+=)(b+«-.}, A^"^ = i^(M„-{2(&+/3)-i})ir(T2+7)(i;TF^T)W, and A^"^ = 

max I^c^^mJu + A'2y^, j- > 7„ + K2^J~^up to constants independent of n, where the lead- 

ing terms are di+''n-iA^"^"'+ d''X^^'^\ x']"U^~'^. Since A^"' > % + & is valid. Using r < rg. 



we can show that Cd 1+= {M/n) 1+= < A2 /4 by setting the constant K sufficiently large, hence ( |27] i is 
valid. Moreover, since 

R < £^ and T, < T, we can show that d < M. 

' — 2(l-s) — 

iii) Suppose T3 < r < t^. We take A^"^ = max i^c/W/^ + k2^^, F^f^^^'^. Then the RHS of 

the inequality ( |28T l is minimized by Aj""* = K\fd\^^^ ^ ^ dMjn and d ~ (ij) ^"'+''' up to constants, where 
the leading terms are <i''A^"^^ + di+''Aj"' V \^^'^d^~<^. Note that since A^"^ > 7„ + ^'2^^, (Il6]l is vaUd. 

Using T < r4, we can show that Cd 1+= {M/n) 1+= < Aj /4 by setting the constant K sufficiently large, 
hence (Elli is vaUd. Moreover, since /3 < and n^'a < A/, we have d = {fj)"^^^ < M. 

In all settings i) to iii), we can show that > Thus the terms regarding t is upper bounded as 

+ + ^^2"^^ < + ^2"^^)(V^ + Through a simple calculation is evaluated as 

i) ~ ^- iT^^+bKli^Ki-^f , ii) ~ (7V/(2+'')(i-/5)n~(''^+2^+''''~2))2{(p'+i.)("2+=)-»} ^ and iii) ~ 
(Af^"^?!"'^^'^^'') respectively. Thus we obtain the assertion. ■ 

Proof: (Theorem |3) 

(Convergence rate of block-£i MKL) 

Note that since A^"^ > A2"^ = 0, we have = 1. Therefore Lemma |7] gives X]m=i ll/m||'H„i ^ 

3i? with probability Thus 7„ = 7n(l + il/-r lloo) < 7n(l+E™=i ll/mll«„.+E™=i WPmUJ < 

7r.(l+4i?). 

When A*"^ = and A^"^ > (1 + 4i?)7„ + K2 \ —, as in Lemma [8] we have with probability at least 

1 - e-* - n-^ 



ii/-riiL(n)+Al"'Eii/'"ii«" 



7n^I 
1 1-s 



+ K2Y.\ll\\rm- fraU^iU), (29) 

for all t > log log(i?V^) + log M. 

We lower bound the term A^"''X]me/(ll/™ll'Wm ^ ll/mll'Hm) i" the LHS of the above inequality (|2TI) . 
There exists ci > only depending R such that 



2 



> cill/™ - - 2i|/,;||^M(/™ - + (30) 



1/2 

for all /rn € "Hm such that < and m G Iq. Remind that = Tm g*^, then we have 

> ci|i/,„ - - 2giJ^||/,„ - /,;|U,(n) + ll/;„lk„. Since max„, < 3R are 

met with probability 1 — n~^, 

\\frn\\-H„, > Cl|l/„i - f^W-H,^ - 2--j^——^\\f,n - .f*n\\L2{n) + ll/mll«„., 

with probability I — n^^. 
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Therefore by the inequahty (|29] l, we have with probability at least 1 — e * 



ii/-riiL(n)+Ai"'B^iii/™-/,;iil,„-2. 



I.9mll«„ 



/m - / *„ 1 1 L 2 ( n ) +1 1 / *„ 1 1 « ™ ) 



me I 

Wfm ^ /mil L2(n)ll/™ ^ fmWT-lr, 



/?n 1 1 



me I 



«„ + 2a(")5]||/;;||«„ 



■me I 



me J 



mel 



rn||L2(n), 



for all t > log log(i?Y^) + log M. Thus using Young's inequality 



(31) 



ll/-rili2(n) <c 



The RHS is minimized by d = and A^"^ = max <! Kn + K2J 



log{Mn) 



(up to 



constants independent of n). Note that since the optimal A^"-* obtained above satisfies A^"'' > (1 + 4i?)7„ + 
K2 by taking K sufficiently large, the inequality ( ISTT i is valid. Moreover the condition M > ti^-- = 
Yi(fi+bnb(2+s)+2} jjj fjjg Statement ensures d < M. Finally we evaluate the terms including t, that is, -^d + 
—d^~^. We can check that —d^^^ < \ —d^"^. Therefore those terms are upper bounded as -6?^+'' + 



7,'^"^ < V + t) ~ n 2(2+.)(b+f)) _^ Thus we obtain the assertion. 

(Convergence rate for block-^2 MKL) 

When A^"^ = 0, substituting Jj\/ to / in Lemma [8] and using Young's inequality, as in the proof of 
Theorem|2] the convergence rate of block-^2 MKL can be evaluated as 



ll/zd - //JlL2(n) < 



,f6\(«)' 



tx 



+ ! A/1+6 



(32) 



with probability 1 

v(") 



n ^ (note that since / = {1, . . . ,M} (/^ = 0), we don't need the condition 

gives the minimum of the RHS with respect to 



XT > % + K^^i). a(") = if(f )^ V F 

X^^'^ up to constants. Using r < rg, we can show that M r 
setting the constant K sufficiently large, hence d27T i is valid. 



log(A/Ti) 



-{M/n)— = M 1+= n-— < Xf' by 



B Proof of Lemmas |7] and H] 

Proof: (Lenuna|7ll Since / minimizes the empirical risk ([U, we have 



n / M \ ■ 

i=l \m=l / 



+ a<"^II/II.,+a("^II/II,^, 



M n 



^-EE^«(/"(^'0-/m(^.)) + Ai"'iirii., + A("'iirii^^. os) 

m— 1 i—1 

By Proposition [T] (Bernstein's inequality in Hilbert spaces, see also Theorem 6.14 of ISteinwartI (|2008|) for 
example), there exists a universal constant C such that we have 



~ / J ^iifmi^i) ~ fmi-'^i)) — ~ / , ^i^mi^i: ') 
rj — ^ Ti ^ — ^ 



/m /mll'H„ 



<cL,/MM!^ii/,„ - II,.. < cL,/M^(ii/;.ii«,„ + ii/:ji« J 



(34) 
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for all m with probability at least 1 — n ^, where we used the assumption < i jf ^(") > 



4CL^i2S(^, then we have 

Al"'ii/ik. + An/Ill < 3(Ar V xrmnu. + ii/iii), os) 

with probability at least 1 — n . Set r — — j-^, then by Young's inequality and Jensen's inequality, the 

V A2 

LHS of the above inequality ( |33] ) is lower bounded by 

M 
ni—l 

>M(Al")vAr^)('^^||/„||^-;:) 

\ m=l / 

>M'^-i(Ai"'vA("))||/||,2-^ (36) 

Therefore we have the first assertion by setting F = ACL. 

The second assertion can be shown as follows: by the inequaUty ( [33] l we have 

M-'x^-^[\\f-r\u,y <x^-^\\f^n\l 

r. M n M 
m— 1 2—1 m— 1 



<a(") ( | + 2nmx||/;„||«,„ ) ||/-r||,, (37) 



3 



with probabihty at least 1 - n-\ where we used (O, A^"^ > iCL^^-^SiMlH and A^"^ > A^"^ in the last 
inequality. ■ 

Proof: (LemmaH) In what follows, we assume j|/ — f*\\ei < R where R = AMR (the probability of this 
event is greater than 1 — by Lemma|7ll. Since / minimizes the empirical risk we have 

PM Yf + a(")||/||,, + a(")||/||2^ < P„(/* - Yf + Ai")||/*||,, + \t\\n\l 

=^ p{f - rf + x^'\\f.,\w + < (p-p„)((.r -/? + 2(/ -/*)£)+ 

+ ^[''\\\mir~\\fl\\ir) + ^^^\\\ml-\\fl\\V + ^^l^^ (38) 

The second term in the RHS of the above inequality ( [38] l can be bounded from above as 

{\\fn\i^-\\fI\\i^)<J2(^\\frn\\n„.^frr^-rJn,„ 
me I 

E{9miTnl ifm — fm))'H„^ ^ V"^ 1 1 .9 m 1 1 "Hm 1 1 ? ^*|| incw 
FFC II-'" " /™lli2(n), (39) 

1/2 

where we used /*j = Tm g*n. for m ^ I C Iq. We also have 



me/ 

< xi"\Y. ngmW-uju - /:jiL.(n) - wh mi). m 



me I 

Substituting (|39]l and (gOll to dMJ, we obtain 

ii/-riiL(n) + Ar'ii.//-./;iii + Al"'ii/.ii.. + Ar^ii/,/ii,^, 

<(P - P„)((/* - + 2(/ - He) + J2 (Ai"^Mir^ + 2A(")||g:„||«„J!|/m - /;„|U,(n) 
+ Ai"^li/}||.,+A("'||/}||2^. (41) 
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Finally we evaluate the first term (P- )((/* -/) ^ + 2 (/-/*) e) in the RHS of the above inequahty ( 1411 ) 
by applying Talagrand's concentration inequality (iTalagrandl 1 1 996allbl iBousquell |2002|) . First we decompose 
(P-Pj((r-/)2 + 2(/-r)e) as 



M 



m— 1 

and bound each term (P — Pn)((/* — /)(/*„ ^ /m) + 2(/m — /m)^) the summation. Here suppose f E Ti 
satisfies ||/||oo < WfWd < P for a constant P (< P). Since |e| < L, we have 

|//rn + 2/™e| < 2(L + P)|/| < 2(L + P)||/„||«„, (42a) 



< ll/llL2(n)ll/m||L2(n) + 2P||/m||L2(n), (42b) 

for all f eH. Let (5„/ i ^ifi^i: Vi) where {ei}"^i S {±1}" is the Rademacher random variable, 
and *™(^ Cm) be 

*m(Cm,Cr™) := E[sup{Q„(|/™|) I /m e "Hm, < $m , |1 /m |1 L2 (H) <Crm}]- 

Then one can show that by the spectral assumptions (A5) (equivalently the covering number condition) 

*m(e™, Cm) < (^kl^ V n-i^e, 



where Kg is a constant that depends on s and C2 (iMendelsoni l2002l) . Let 'E,m(£.m, <ym) '■= \ fm € 'Hm I 
||/m||w,„ < 6ni ll/jn||L2(n) < CTm}. Now bv Rademacher contraction inequaUtv (iLedoux and Talagrandl 
Hill Theorem 4. 12), for given {^m, <Jm}m<£i and P we have 

E[sup{Q„(//m + 2/,„e) I / e -H such that /,„ e S.^i^ 

), \\f\W<R}] 

<2{L + P)*,„(Crn, Cm) < 2A',(i + P) (ikl^l y „-T^Cm) ■ (43) 



Therefore by the symmetrization argument (Ivan der Vaart and Wellneii Il996l) . we have 

E[sup{(P„-P)(//m + 2/me) I / e H such that /m e Sm(^m,Cm), ||/||,, < P}] 

<4K,(L + P) V n-TT7^„^ . (44) 



By Talagrand's concentration inequahty with (l42t and (|44] |. for given P, ct, ^m, Cm with probability at 
least 1 — e^* {t > 0), we have 



sup (P«-P)(//m + 2/me) < 

ll/lli,2(n)<*.ll/ll~<«./'"e=™(«™'<"") 

V2 (^AK,{L + P) (^^^^ V + y^(cam + 2Pam) + 2(L + P)^™^) ■ (45) 

where we used the relation (l42t . Our next goal is to derive an uniform version of the above inequality over 

i^^'^^' 7^^^™^^ ^"'^ 7^^^'"^^- 

By considering a grid {P('^i\ c^'^^^ C^^^ c^''^})°f4jS'^ 4) ^uch that pC^) := P2-^ ^C^) P2-^ 

:= P2-'= and ailt'> := R2-'', we have with probabiHty at least 1 - (log(A//PV^))''e-* > 1 - 
(log(4PM2v^))4e-* 

iPn - P)(//m + 2/m.) <K {I + ( " " ^ " V + tjf^] 



j2t 

+V — (ll/IU2(n)||/m|lL2(n) +2P||/m|lL2(n)), 
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forall/ e -Hsuchthat < i?and||/||£, < i?, and for all i > 1, where if iiiK^LV AKsV2LV2). 

Summing up this bound for to = 1, ... , M, then we obtain 



V n" i "-^"^^("^ ^ ll/mlUam + 2i X! ll/™IU2(n) 1 

\ m— 1 m— 1 / 



uniformly for all / G 7^ such that < R (Vm) and < R with probability at least 1 — 

M(log(4i?il/2V7I))4e-*. Here set 7„ = ^ and note that ^||/||^^(n) Ei=i ll/m|lL.(n) < ill/llL(n) + 
KE™=ill/™llL.(n))^<ill/llL(n) + Kll/ll^.)' then we have 



(P„-P)(r + 2/e) <X(1 + ||/|UJ 



me I 



M 



1 ^' 
.(1 + ll/il^JII/^IU. + 2 ll/llL(n) + 2v^iV n ^ U/'-H^^m- ^^6) 

m— 1 

for all / e "H such that < R (ym) and \\f\\ei < R with probabiHty at least 1 - 

M{\og{ARM'^y^))'^erK We will replace t with t + 51ogA/ + 4 loglog(i?A/n), then the probability 
1 - M (log(4i?ynM^))''e"* can be replaced with 1 - e"* and we have i + 5 log M + 4 log \og{R^/n) < 6t 

for all t > logM + loglog(i?-y/n). On the event where ||/ — /*l|ci < R holds, substituting / — /* to / in 
( |46] l and replacing K appropriately, dTTT i yields 

^11/ - rwlm + ^2"' E iiA - + ^2"^ E + (^1"' - E 

<-^^i(l + J - ./ \Ui)[ >, ^ V 1 \ 

+E (Ar^t^+2Ar'ii.9:ji«„.)ii/™-/;ju.(n)+Ar' E ii/;ji«„+(Ai"'+7n) E 

n- M 

+ ^2^-E ll/™-/™ll^2(n), (47) 

m— 1 



where Ki and 7^2 are constants and % = 7„(1 + WfWe^)- Finally since ^ J2m=i II/™ ^ /mlU2(n) = 

-^2y|-(Z]me/ ll/ni ~ /mllL2(n) + SmG.7 1 1 /™ 1 1 ^2 (H) + Sme.7 1 1 /)n 1 1 L2 (H) ) < 2 y^(I]me/ 1 1 ~ 
ll/m||w„. + Erne, 7 WfrnWuJ, (113 bcCOmeS 

^11/ - /1li.(n) + A^' E 11/^ - fiWk. + A^"' E ll/™ll«^ + f^l"^ - ^" - ^2^) E ll/™ll«. 

<Ki{l + II / - /*|lf )( E ^ -^'""^^mll/™ " /mll«„, ll/m - /m||«„ , - /lUi ' 



mel 



ni+s 



+E fA("^^^+2Ar^|lg;;|l«„.+ A-2^/1') ||/,„ - /;j|,,(n) 

+ A^"^ E II.CII«. + (^1"'+^" + E ll/™ll«- (48) 



which yields the assertion. 
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C Proof of Theorems i and g] 

We write the operator norm of Sj j : V. j ^ Hi as \\Si jW^Hi Hj '■= sup .ig-iWui ^ 

aje-Hj.gj^o 

Definition 9 For all 1 < m,m' < M, we define the empirical (non centered) cross covariance operator 
^m,m' as follows: 

1 " 

{fni,'^ni.m'gni')Hm - '^Jiyi{Xi)gm' {Xi) , (49) 
i=l 

where /,„ € 'Hm, 9m' € ^m'- Analogous to the joint covariance operator S, we define the joint empirical 
cross cove 
such that 



cross covariance operator T, : Ti ^ H as (Sft,)™ = X]f=i '^m,ihi. We denote by Sm,e the element ofHn 



1 

n. ^ — ^ 



n 

i=l 



Let R be a constant such that MEm^i WfLWn,^ + E™=i WfLWnJ < R- We denote by F,, the objective 
function of elastic-net MKL 

ri M M 

i— 1 ?7i— 1 m— 1 

Proof: (TlieoremS) Let / G ®m<zi„T-Lm be the minimizer of f'n: 
/ := argniinF„(/), 

where i?„(/) - ^(/(^,) " y.)' + A^"^ ^ + A*"^ ^ 

f^fep 7j We first show that / A- f* with respect to the RKHS norm. Since A^'^'-^/n — > 00, as in the 
proof of Lemma|7] the probability of J2m=i il/™ ^ /mll'Hm — VmR goes to 1 (this can be checked as 



follows: by replacing y^ '°g(^-^") gq ( [34] | with log(A/) Aj"'' , then we see that Eq. ( l34b holds with probability 
1 — exp(— A^^"'' n)). There exists ci only depending \/MR such that 



2 



Since / minimizes F„, if X]m=i 11/™ ^ /ml I'M™ — VMR (the probability of which event goes to 1) we 



||/m||«„, — \J\\fm - frnWiim ^ ^^/™ ^ /*" ' f*n)n„^ + \\frn\ 

>ci\\fm - - 2||/:j|«tJ(/™ - fmJmUJ + (50) 

for all m G Iq and all /,„ e "Hm such that H/mll-Hm < \/MR 
S 

have 

(4 - fi,±io,ioifio - /;o))«.o + ciA^"^ E 11/" - /™ii«.. + -^2"' E II/" - /™ii«™ 

mG/o mG/o 

<2(S,„„/- + 2 ^ ("^—At") + a(")) |(/™ - /4, I, (51) 

^j^\\\U\n„. J 

where we used the relation (ISOl l. By the assumption /,,'^ = Em^mffm, we have \{fm — f*m fm)'H,„\ ^ 
ll.9,*ill«™ll/rn - /mlUain)- By Lemma[TO]and Lemma[TT] we have 

IIS™,™' - Sm,m'||w„,W„' = Op(l/V?^), ll^7o,e||«7o = Op(l/\/n). 

Substituting these inequalities to ( ISTT l. we have 

ii/-/iiL(n) + '^iAi"' E ii/™-/;nii?.,„ + Ar^ E ii/m-/;„ii?,„ 



(n) 



(52) 
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Remind that the (non centered) cross correlation operator is invertible. Thus there exists a constant c such 
that 

11/ - /1lL(n) = (//o - /;„,S/oJo(//o - /;„))« = (//o - /;„,Diag(I]^/2„JV7o,/oDiag(I]i/2„J(/,„ - 

>C ^ {fm - fmi ^m,mifm " /m))«,„ ^ X! H^™ ^ /" 



/"m 1 1 Lain) ■ 



This and Eq. (l52]i give that using ab < (a^ + b'^)/2 

ii/-riiL(n) + ciAl") ^ ii/m-/;„iik + 4"^ E 

<• n f SmgJo II/'" ~ /mll?^,,. ., («) II \ 

^ I ^ (-^1 +^2 ) 2^ ll/m - /m||L2(n) I 



rnelo melo 



^oJ^ + ix['''+X^T^A+'ix[''^ E ll/™-/™llL. + ^ll/-/1lL(n)- 



Therefore we have 

^11/ - /1li.(n) + yAi") II/'" - fnMk. + A^' E H/'" - f^nWk. < Op + (A^"' + a'"')^ 

melo melo V'^Aj 



This and Aj"'-yn ^ oo gives ||/ — /^^^ 11-^^^^ ^ in probability. 

(Step 2) Next we show that the probability of / = / goes to 1. Since \\f ^ fig\\-Hig 0, we can assume that 
ll/mllw,,, > (to e Iq) without loss of generality. We identify / as an element of H by setting /,„ = for 
TO e Jq. Now we show that / is also the minimizer of F„, that is / = / , with high probability, hence I ~ Iq 
with high probabihty. By the KKT condition, the necessary and sufficient condition that / also minimizes 
Fn is 

||2S,„./„(//o - /;„) - 2S™.J«„. < A^"' (Vto e Jo), (53) 
(2S,„.,„ + 2A("' + a'"^ A^)(//„ - /;j + aJ"^ A^/;„ + 2A("V;o - 2^/0. ^ = 0, (54) 

where -D„ = Diag(||/m|j^^ ). Note that ( |54] i is satisfied (with high probability) because / is the minimizer 
of Fn and ||/m||w„ > for all m G Iq (with high probability). Therefore if the condition (ISJt holds w.h.p., 
/ = ./w.h.p.. 

We will now show the condition (ISJt holds w.h.p.. Due to ( |54] |. we have 

4 - /;„ = -{2tj„j,, + 2A$,") + A(")AO-M(Ai"^Z?„ + 2A("V;o - 2S,„,,]. 
Therefore the LHS of (|53), \\2%njoifio ^ f*!,) ^ '2-%nA\'H^^ can be evaluated as 

= ||2E™,,„(2E,„,,„ + 2A(") + A("^7^„)-i(Al")i?„ + 2A("V;„ 

— 2Emjo(2S/(,,/o + 2A2"' + Aj"^_D„) ^2S/p^e + 2S,„^e||^^^^ 
<|12E™.,J2E,„,,„ + 2A(") + A(")z?„)-i(A(")i?„ + 2A("V;oII«™ 

+ \\2t^j,{2ti,j, + 2A^"' + Ai"'i?„)-i2E/„,, - 2E™,,|1«„. (55) 
We evaluate the probabihstic orders of the last two terms. 
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(i) (Bounding B„,„ := |l2E„,/„(2E/„,/„ + 2A(") + a5")AO"'2E/„,. - 2S™,,||«,J We show that 



1 



Since O < ( ^^"-"M , we have 



'2E,„,,„+2A(")+a1")d„ 




The second inequality is due to the fact that for all (//^^ , fm) G V-if^um we have 



\ /E,,,,„+A("'+A("'i?„/2 -E,„ 



m,Io 



flo 



> 



because of O ^ ('^^'''■^" 
Thus we have 



^/oJo + ^2 ^ 2 



V -I- x^") 



2Em^yy2 + 2A2 ^ 



< Op(1/a/^). 



(56) 



Here the LHS of the above inequality is equivalent to 



+ 2Ar+Ari^«)-'S,„., + (E 
Therefore we observe 



E,„j„(2E,„,/„ + 2A^"^ + a(")aO"'E7„,, + |E,„,, 



1. 



= Op(l/xAI). 



Since ||E„i.e||-H„ = Op(l/\/?i), we also have 

||E,„,/„(2E7„j„ +2A^"' + A("^i?„)-iE/„,,j|H„ = Op(l/V^). 
This and j|Em,e||?/,^ = Op(l/i/n) yield 

-Bn,m = Op(l/V«^). 



(57) 



(ii) (Bounding E„,rn := ||2E,„,/„(2E,„,/„ + 2A^"^ + A^"^ AO^HaJ"' A. + 2A^"V;oll«^) Note that, due 
to 11/ - f*\\n ^ 0, we have D„ D, and we know that max„,„/ ||Em,m' - E„,„' = 
Op{^\og{M)/n) = Op(^) by Lemma[l0] Thus S,, := (2E/„,/„ - 2E/„,/J/Ai"^ + D - D,, satisfies 
5'„ = Op(l) and thus D — Sn h D/2 with high probabiUty. Hence 

2E„,,„(2E,„,,„ + 2A(") + A("'i?„)-i(Al"^i?„ + 2A("')/;o 

=2E„,j„(2E,„j„ + 2A^"' + A("'A.)-'(Al"'i?„ + 2A("')/;„ + Op (^-^ 

=2E„.,„(2E,„.,„ + 2A(") + A(")i?)-1(A(")Z?„ + 2A("V;„ + 
2E,„,,„(2S/„j„ + 2A("' + A(")i?)-iA("'5„(2E,„,,„ + 2A(") + x["\d S,,))-' (x["^ D,, + 2\^'^)fl 

+ 0J'). (58) 
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Here we obtain 

1 1 

<|!Sm,mKn.Jo(2^/oJo) ^/o ,mSm,m 1 1 =Op(l), 

and due to the fact that D — Sn h D /2 with high probability we have 

ii(s/o,/o + A^"' + \^;'\d 5„))-^(a(")a. + 2a("')/;„ii«.„ 

= + a'"' + x["\d - 5„))-^Diag(si„)(A(")i?„ + 2Xi'''>)9l\\n, 



<0,{\\VCiJ\nUjA''^ + >^r)) = Op(Ar + AD- 
Therefore the second term in the RHS of Eq. ( |5Ft is evaluated as 



An) 



An) 



(59) 



||E™,,„(2I],„j„ + 2A^"^ + Ai"^i?)-iAi"^5„(2E,„,,„ + 2A^"^ + X["\D^Sn)r'{XrD,, + 2A},"0/;J1«,„ 
<\\j:r,j,X2J:i„j,,+2Xi''^ +X^^''^Dri\\n^^nJ^^^ 

||(2E,„,,„ + 2A(") + x["\D^Sr.))-^\\n,,.nrjm„,io + A^"^ + A("\i?-5„))-^ (a1"> A. + 2A("V;„lk 
<0,(1 . (Ai") + Xi-Y^ ■ a("'o,(1) . (a(") + xi^Yi ■ (a("' + A^')) 

Therefore this and Eq. (|58] l give 





^ 2A<") 


f Ai" 


7^„)-1(A<"'a. 


+ 2a("V;o 




2SmJo(2S/o,/o - 


f- 2A<") 


f a(" 


i?)-i(A(")i?„4 


-2a("')/;„4 


o,(aH) + o,(-L) 


2S,„j,-,(2I]7„ j„ - 


F 2A("^ 


f Ai" 


i?)-i(A(")A.4 


-2a("))/;„4 


o,(a(")). 



Define 



An 



Sm,7o ^Io,Io + A 



An)' 



A := E™,,„(E,„,,„ + a("))-i ( ^ + 2^ /;,. 



An) 



-fo' 



We show \\An — A\\-u^ = Op(l). By the definition, we have 



A ~ A„ —J^rnJoi^IoJo + ^2 ) 



X["'^D 



{D-D„)fl. 



A 



(60) 



On the other hand, as in Eq. ( |56] l, we observe that 

2> ( S/o./o Al],„j„+A^"V 

\ 



Sm,7o(S/o,7o + A2 ) 



"H/q Um j'H/q Ut7i 



(61) 
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Moreover, since = Sm,™^*; (Vm), we have 



(n) ^1 



■Hi 



An)' 



Diag(E4,™) + 2-f- gl 



-Hi 



< 



Diag(E,?j,„i) 



<Op((Ai"^ + ArO- 



We can also bound the second term of ( l60l l as 



) < Op(Ai' 



(«) 2 - 



(62) 



Sm,7o j(, + A2' 



(") 



< 



< 



(ri) 



(") 



<2||p-i^„)/;„||^^^^ (•.•Eq.dSD) 

=Op(l). 

Therefore applying the inequalities Eq. ( |6T| l and Eq. ( |62t to Eq. ( |60] l, we have 

II A„ - A||«„^ = Op(A(")^) + Op(l) = Op(l). 
Hence we have E'n^m = A^""* || Ajl-^,^ +Op(A^"''). 

(iii) (Combining (i) and (ii)) Due to the above evaluations ((i) and (ii)), we have 



(63) 



max 



- max A 

va^J 



This yields 



(") 



s™,/„(E,„,,„ + a("))-i (^i? + 2^j /;„ 



A 



+ o,(A("))<A(")(l-r?) + Op(A(")). 



P { ||2I],„j„(/,„ - /;j - 2E„,,J«„ > A^"\Vm G J, 



Thus the probability of the condition ( |53] l goes to 1 . ■ 

Proof: (TheoremlD First we prove that X^"^ ^/n ^ oo is a necessary condition for / A /q. Assume that 
liminf A^"'' -y/n < oo. Then we can take a sub-sequence that converges to a finite value, therefore by taking 
the sub-sequence, if necessary, we can assume liraX''^^ ^/n — >■ /^i without loss of generality. We will derive 
a contradiction under the conditions of ||/ — /*||-h A and / A /q. Suppose / = /q. 
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By the KKT condition, 

= 2(S/o_/o/7o - - S/ojoZ/J + x["^Dnfi„ + 2A^"\//o 

2{ti„j„ + a("))(/;„ - A„) = a(")aJ;„ + 2A^"V;„ - 2S/,,. (64) 
=^ 2VH(S,„.,„ + A^"))(/;.., - A„) = V^.xi'^^Dfl + ^/H2A("V;„ - 2^A^E/o,. 

+ (2V^(E/„,/„ - S/„/J(/;„ - //„) + V^x[''\Dn - D)fl) 
=> 2V^{Ei„j„ + a("')(/*, - 4) = ^i,Dfl + V^2A("V/; - 2V^E,,,,, + Op(l), (65) 

where the last inequality is due to V"-^!"' ^ /^i. llAi - D\\-Hj^^^-h,„ = Op(l), ||/ - f*\\-H = Op(l) and 

\\'Eigjg~Y.igjg\\-H,^^,-Hj^^ = Op(l). Moreover since the second equality (|64|| indicates that Opfl) + Op(A2"^ ) = 

a(")d/;^ + 2A^"V;„ + we have A^") = 0^(1) and A^") = Op(l). 

We now show that the KKT condition under which / satisfying / = /q is optimal with respect to F„ is 
violated with strictly positive probability: 

liminf P (3m G J, ||2(S,,„,/Jj„ - ±m,iofl - S,„,e)||«,„ > A^"^) > 0. (66) 

Obviously this indicates that the probability I ~ Iq does not converges to 1, which is a contradiction. 
For all Vm & {m G Jo), there exists w/„ € such that 

S/o,,„w,„ = + A^"^)w/„. (67) 

Note that wj^ is uniformly bounded for all Aj"^ > because the range of S/,, is included in the range 
of S/ojo jBakeii 119731) and there exists w/^ such that S/o,mWm = ^io,io''^io (^^o independent of Aj"'), 
hence ^.i^j^wi^ = (S/o,/o + A2"'')^'/o' and 

\\wio\\-H,„ < \/(m)/o,S/o,/o(5^/o,/o +4"V^S/oJoW'/o)«Jo ^ l|w/oll«/o 

for Aj"^ > and ||it'/o ll'H/o ^ ll''^^ll?^/o '^z"^ — 0- Let Vm <= "Hm be any non-zero element such that 
'^m%.Vm ^ and wj^ be satisfying the above equahty (l67T l. then 

==Vn{Vm,^m,e)-H„, + {Vm,%n,IoVn{fj^^ - flo))n,„ 

= Vn{v„i, Sm,e>«,„ + {Vm,^m,lVn(fjg - fla))Hrr, + Op{l) 

= Vn{v„i, tm,c)u„, + {wi„,i^IoJo + A2"')V"(//o - flo))u„, + Op(l) 

where we used ||S,ri,/o ~ ^mjo ll'Hm.'H/,, = Op(1/y^) and j|/* — f\\-u -'^ in the second equality, and the 

relation i65[ in the last equa lity. We can show that Z„ \/n{vm, ^m.e) — \/n{wig, f^ig^e) has a positive 
variance as follows (see also lBactil (l2008l) ): 

E[Z„] = 0, 

Vm) - 2(w,„, ■Wlo)) 
= (T^ {(v„i, Y.m,mVm) - {Vm, S,„,/ol^/o) + ('.■ A^"-* = Op(l)) 

= '^^(Sit(,L«m, - Kn,/o"^/oJo^^O:m)Sm,m^^m) + Op(l), 

where Vf^j^ = Diag(i;,\0„)(i;/o j„ + A^"-')"^Diag(Eit{L) (note that Vf^ is invertible because Vi^j^ < 
Vigjo and Vigjo is invertible). Now since ^ F/q./o and I-h,,, - VrnJoVf^jVi^.m >- O (this is because 

l^/oUm,/oUm = {vif^ \'°) invertible), we have - Kn,/o V};,/o^/o,m ^ O. Therefore by the 

central limit theorem Z„ converges Gaussian random variable with strictly positive variance in distribution. 
Thus the probabihty of 

I > A/'li;„i||«,„ 
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is asymptotically strictly positive because A^"''^/n — > fii (Note that this is true whether y^X^^ converges 
to finite value or not). This yields (|66] |, i.e. / does not satisfy / = /q with asymptotically strictly positive 
probability. 

We say Condition A as 

Condition A : x'{''^y/n — >■ oo. 

Now that we have proven A^""* -y/n — > oo, we are ready to prove the assertion (IT&I . Suppose the condition 
(O is not satisfied for any sequences A^"' , 0, that is, there exists a constant £. > such that 



lim sup 



for any sequences Aj"\ A2"'' — > satisfying Condition A (X']^'y/n — !• 00). Fix arbitrary sequences 



9io 



>{l + 0, (3meJo) 



(68) 



(") 



A^"' , A^"^ satisfying Condition A. If / = /„, the KKT condition 

\\2±rn.i„{fi„ - fl) - 2Sm,.ll«„. < A^"' (Vto G Jo), 



(") , 



An) 



(69) 
(70) 



should be satisfied (see (l53T l and (l54li). We prove that the first inequality i6% of the KKT condition is violated 
with strictly positive probability under the assumptions and the condition ( |70] i. We have shown that (see (|55] l) 



A^"' {2T,„i,i„{fi„ - fi„) - 2i;„,e) 



X(") 



2I]„,,„(2S,„.,„ + 2Ar^ + A^^AO^nA. + 2^)/;„ 

7(^S„i_/g(2S/o j(, + 2A^ ^ + A^ ^Ai) ^2S/„,e + -;7;;jS„ 



A 



A 



(")■ 



(71) 



^1 ^1 
As shown in the proof of Theorem 1, the first term can be approximated by 



more precisely Eq. (I63t gives 



^-fo Jo + A 



(n) -^1 i^ri 



.A, 



A 



.A 



(«)- 



A 



AO. 

Since liminf„ 
that 



S™.7o(S/o Jo + A^"VM ^ + 2ff;rT ) .gl„ 



> (1 + ^) by the assumption, we observe 



2S,„,,„(2S,„,,„ + 2A^") + A^") A.)-^ ( A + 2^ I 



'x\ 



Now since A^^"' ^/n — > 00, we have proven that 



S™,/„(2E/„,/„ + 2A^"^ + XrDn)-'2^i, 



Ai' 



Ai 



(")■ 



>(l+0 />o. 



= Op(l/(Al"'V^)) =Op(l), (73) 



(72) 



in the proof of Theorem 1 (Eq. (l57t). Therefore, combining (ItTI i. ( l72b and ( |73] |. we have observed that the 
KKT condition ( [53l ) is violated with strictly positive probability if the condition (|68] l is satisfied. This yields 
the irrepresenter condition (IT&t is a necessary condition for the consistency of elastic-net MKL. ■ 



Lemma 10 /fsup ^ krn{X, X) <1 and sup^ km' {X, X) <1, then 



P{\\%yi^m' - 'i^m.yn'W'Hrr.^'H' > E[||E„,„i/ - T,„i,Tn'\\-H^.-H'] + s) < exp(-ne^/2) 



/« particular, 



P 



> \/ - + e < cxp(-ne72). 



(74) 



(75) 
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Proof: We use McDiarmid's inequality dPevrove et al.L[T996l) . By definition 

1 " 

' i=l 

We denote by 'S,m,m' the empirical cross covariance operator with samples 
{xi, . . . , Xj-i,Xj, Xj+i, . . . , Xn) where the j-th sample Xj is replaced by xj independently distributed by 
the same distribution as Xj's. 

By the triangular inequality, we have 

||Sm,m' ~ ^m.ni'\\'Hm,'Hr„' ~ ll^m,m' ~ ^m.m' 55 ||Sm,m' — ^m,m'\\'H,n,'H^i ■ 

Now the RHS can be evaluated as follows: 



ij^ra ( ' ! Xj ) kjn' {Xj ; ' ) ( ' i Xj ) kjyi' {Xj , ' ) ) 

The RHS of ( |76] l can be further evaluated as 

\\j^{km{-,xj)km'{xj,-) - k,ni- , Xj)km' {xj , 

<^i\\kmi- , Xj)km' {xj , ■)\\n^,n„^, + \\km{- , Xj)k,n' {xj , ■))\\n^,n,„,) 

<^{\\km{- , Xj)\\'H^\\km' {Xj , ■)\\-H^, + 1 1 (■ , j ) || 1 1 ^m' (^j , ' ) ) II ) 



(76) 



<- 



(77) 



where we used ||fcm(-, a^jOlj-H,,, = \/ {k„i{-,Xj), km{-,Xj))fi„^ = \/ km{xj, Xj). Bounding the norm of (l76l l 
by (iTTb . we have 

2 
n 

By symmetry, changing E and S gives 

2 

|||S„, m' — '^m,m'\\Ur,i,'H„i ~ \\^m,m' — '^m,m'\\'Kr„.M„,, \ ^ ~- 

n 

Therefore by McDiarmid's inequality we obtain 

2£2 



< exp 



cxp 



n(2/n)2^ 

This gives the first assertion Eq. (|74] |. 

To show the second assertion (Eq. (l75ll), first we note that 



E[||Em,m' — '^m,m'\\'HmMmi] — \/^[\\^m,m' — ^m,m'\\'^^'^ ,] 



— y -E[|| — S„i_m')(^m',?n ~ 5]m',m)||tr]j (78) 

where || • ||ti is the trace norm and the last inequality. As in Lemma 1 of iGretton et alj (l2005h . we see that 

|l(^?7i,7n' ^7n,m'^i'^7n' ,m ^m',m)||tr 



1 " 

2 ^ ^ \\kmi_' T Xi^k^f (^Xi^ Xj^k^i^Xj ^ -^W^^i- 

2 " 
n ^ — ' 

i=l 

1 " 2 " 

— ^ k,n{Xj,Xi)k„^'{Xi,Xj) ^Ex[k,n{X,Xi)kjn,{Xi,X)] + Ex,X'[kmiX' , X)kjn, {X, X')], 



*J=1 
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where X and X' are independent random variable distributed from H. Thus 



E[||(Sm,i; 



'^^^x[km{X,X)k^,{X,X)] 



^m' ,m ) II t: 

n{n — 1) 



Y.x,X'[km{X\X)k^,{X,X')] 



- 2Ex,X' [kmiX', X)hn' (X, X')] + Ex,X' [km {X\ X)km' {X, X')] 
= -Ex[krn{X,X)k„AX,X)] - -Ex,X'[km{X',X)km'{X,X')] < -. 

n n n 

This and Eq. ( |78] l with the first assertion (Eq. ( |74] l) gives the second assertion. 



Lemma 11 7/"E[e^|X] < almost surely and sivpx k„i {X, X) < 1, then we have 



Op(cr/v^). 



Proof: By definition, we have 



E[I|S„.,.||„..] < VE|lp,n,.ll|,.,l 



\ 



E 



1 " 



(79) 



< 

Applying Markov's inequality we obtain the assertion. ■ 

Proposition 1 (Bernstein's inequality in Hilbert spaces) Let (Q, A, P) be a probability space, Hbea sep- 
arable Hilbert space, B > 0, and cr > 0. Furthermore, let ^i, . . . ■ ^ ^ H be independent random 
variables satisfying E[^i] — 0, \\£.\\'H < B, fl«c/ E[||^i||^] < for all i = 1, . . . , n. Then we have 



P 



1 " 

71 ^ ^ 



> 



n 



2Bt 
3n 



< e" 



(r > 0). 



Proof: See Theorem 6.14 of lSteinwarll(l2008l) . 
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